METHOD AND SYSTEM FOR OPTIMAL DATA DIAGNOSIS 

FIELD OF THE INVENTION 

The present invention relates to a technique for computing optimal diagnosis 
for inference problems associated with various applications; and more particularly, 
the present invention relates to diagnostic inference problems associated with 
various applications such as fault diagnosis of manufacturing lines for any product, 
analyzing credit risk for banks, mortgage and credit card companies, analyzing 
potential insurance fraud, analyzing bank accounts for illicit activities such as 
money laundering, illegal international transfers, etc. 

More in particular, the present invention pertains to a method and system for 
finding an optimal set of data association rules in automated data diagnosis of the 
data characterizing an entity. 

In overall concept, the present invention relates to a method of analysis of 
relational data bases associated with numerous manufacturing, financial, medical, 
etc. applications where a relational data base includes various measured parameters 
of the application for determining whether the results of the application are desirable 
or not and where the conditions are inferred on the measurement which separate the 
desirable results from the undesirable ones. 
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BACKGROUND OF THE INVENTION 
Various applications exist wherein a plurality of aspects of a process are 
measured and determined whether the result of the process is desirable or not. Such 
applications may include fault diagnosis of manufacturing lines for any product, 
analyzing credit risk for banks, mortgage, and credit card companies, analyzing 
potential insurance fraud, analyzing bank accounts for illicit activity such as money 
laundering, illegal International transfers, etc. In such applications, it is useful to 
infer conditions on the measurements (or other parameters of the application) that 
separate out the desirable results from the undesirable ones. These kinds of 
problems are called diagnostic inference problems. The ability to perform this 
inference often results in corrective actions that increase the probability of obtaining 
a desirable result. 

In numerous applications, relational data bases record information about a 
domain. A relational data base, as known to those skilled in the art, is a data base 
which stores all its data within tables. All operations on data are conducted in the 
tables themselves or alternatively, a resulting table is produced. Each such table is a 
set of rows and columns described in J.D. Ullman, Principles of Data Base and 
Knowledge Base Systems, Computer Science Press, 1989. 

In the relational data bases associated with applications, the data base tuples, 
which are the rows of the data base are used to record information about a particular 
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entity. While the columns of the data base, represent the attributes, which are 
specific parameters of the analyzed entity. In order to separate out the rows with 
desirable results from the undesirable results, specific association rules are 
established which are applied to the relational data base of the application domain. 
The association rules are rules presentable in the form of CI— >C2, where CI and C2 
are conditions that are used to determine whether the entity is desirable and where 
the condition C2 is not necessarily fixed. 

The problem of association rule mining was first introduced in R. Agrawal, 
T. Imielinski, A. Swami, "Mining Association Rules Between Sets of Items in Large 
Databases", In Proc. of ACM SIGMOD, 1993, pp. 207-216. The work was limited 
only to non-numeric data, and all association rules were found that exceed specified 
criterion such as lower bounds for support and confidence. 

The body of work on association rules is extensive, and many aspects of the 
issue have been developed over the years. For example, in R. Srikant and R. 
Agrawal, "Mining Quantitative Association Rules in Large Relational Tables", In 
Proc. of ACM SIGMOD, pp. 1-12, 1996, a framework was introduced which was 
designed to find association rules in data sets that include numeric attributes. The 
authors present concepts of ^-completeness and interest, which are used to reduce 
the number of rules that need to be considered explicitly and to eliminate redundant 
rules. 
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The primary weakness of the framework, however, is that as with R. 
Agrawal, et al. (supra), the framework relies just on the support and confidence 
lower bounds to determine which rules to select. Relying simply on support and 
confidence lower bounds poses problems of an excessive number of rules being 
returned for analysis, as well as the possibility of not returning a rule of interest if 
the bounds are not made too high. 

A further critique of frameworks that rely just on support and confidence 
lower bounds has been presented in S. Brin, R. Motwani, and C. Silverstein, 
"Beyond Market Baskets: Generalizing Association Rules to Correlations"; in Proc. 
of ACM SIGMOD, pp. 265-276, 1997. R. Srikant and R. Agrawal, "Mining 
Quantiative Association Rules in Large Relational Tables"; as well as in Proc. of 
ACM SIGMOD, pp. 1-12, 1996. However, these papers do not address the issue of 
the simplicity of rules. 

The paper of RJ. Bayardo, R. Agrawal, D. Gunopulos, "Constraint-based 
Rule Mining in Large, Dense Databases". In Proc. of ICDE, pp. 188-197, 1999 
presents a framework that addresses the issue of rule simplicity. The paper proposes 
a notion of a rule improvement constraint, in which a more complicated rule is not 
returned if its improvement over a simpler rule is small. However the framework 
only applies to non-numeric data, and again relies heavily on support and confidence 
lower bounds. 
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Two other frameworks of note are Y. Aumann and Y. Lindell: "A Statistical 
Theory for Quantitative Association Rules". In Proc. of ACM SIGKDD, pp. 261- 
270, 1999 and RJ. Miller and Y. Yang. "Association Rules Over Interval Data". In 
Proc. of ACM SIGMOD, pp. 452-461, 1997 address finding association rules for 
numeric data. Neither framework uses the traditional definitions of support and 
confidence of rules, but both frameworks rely heavily on constraints to determine 
which rules to return. 

Another approach to association rule mining is to find those rules that are 
optimal or near optimal according to some criteria. Representative papers in this 
area include S. Brin, S.R. Rastogi, and K. Shim. "Mining Optimized Gain Rules for 
Numeric Attributes". In Proc. of ACM SIGKDD, pp. 135-144, 1999; T. Fukuda, Y. 
Morimoto, S. Morishita, and T. Tokuyama. "Data Mining Using Two-Dimensional 
Optimized Association Rules: Scheme, Algorithms, and Visualization". In Proc. of 
ACM SIGMOD, pp. 13-23, 1996, and R. Rastogi, K. Shim "Mining Optimized 
Support Rules for Numeric Attributes". In Proc. of ICDE, pp. 126-135, 1999. 
These papers study ways to efficiently find optimal association rules according to 
measures such as gain, support, and confidence in certain restricting settings. 

Another paper dealing with optimal association rule mining, presents a partial 
ordering for association rules based on support and confidence. This framework is 
however disadvantageous in that it fails to consider the simplicity of conditions or to 
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remove redundant rules. In addition, the framework is limited only to non-numeric 
attributes. 

It would therefore be highly desirable to have a technique for optimal 
association rules mining which is applicable to both numeric and non-numeric 
attributes, and which would consider the simplicity of conditions in addition to 
support and confidence as well as to optimize efficiency by removing redundant 
rules. It also would be highly desirable that this technique would involve mining not 
just of one rule at a time, but mining of a set of k rules for some number k. 



6 



SUMMARY OF THE INVENTION 

It is therefore an object of the present invention to provide a technique for 
separating in the most effective and least time consuming manner the desirable 
conditions from the undesirable conditions in relational databases associated with 
applications in many areas of human activity. 

It is another object of the present invention to provide a method and system 
for automated data diagnosis associated with relational databases, where a plurality 
of criteria are taken into consideration for the separation of the desirable and 
undesirable conditions in order to find near optimal conditions in addition to optimal 
conditions, and wherein the redundant rules are removed during the process of 
automated data diagnosis. 

It is a further object of the present invention to provide a method and system 
for automated data diagnosis in which the simplicity of conditions is considered 
along with support and confidence of the conditions. 

It is still another object of the present invention to provide a method which is 
applied both to numeric and non-numeric attributes in relational databases for 
computing the optimal association rules. 

According to the teachings of the present invention, a method for automated 
data diagnosis is provided which results in an optimal set of association rules for 
data characterizing an entity. The method includes the steps of: 
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establishing a computer system for automated data diagnoses; and 
creating a relation R containing the data A = (Ai,. . .,A n , A n +i,. . .A n+m ), 
where n, m > 1 . The data characterize the entity to be diagnosed which may be a 
product or a process in basically any application of human activity. 

The data A is represented by outcome attributes A n+ i,. . .A n+m and diagnosable 
attributes A!,...A n , where the outcome attributes determine whether the entity is 
desirable or not, and the diagnosable attributes determine the reasoning of why the 
entity is desirable or not. The outcome and the diagnosable attributes in the relation 
R may be numeric as well as non-numeric attributes. 

Further, a user specifies to the computer system an outcome condition (D) 
which is a selection condition which includes strictly outcome attributes selected 
from the A n+1 ,...A n+m attributes and specifies at least one diagnosable selection 
condition C which includes diagnosable attribute selected by the user from the 
Ai, . . . A n attributes. 

The user also specifies a "simpler-than" ordering (> simpler) criterion, which 
includes a set of the diagnosable selection conditions C which are simpler than a 
predetermined diagnosable selection condition. Additionally, the user specifies to 
the computer system a Data Diagnosis Objective (DDO) by defining the following 
components of the DDO: 
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(a) evaluation domain (ED) providing for measurement of quality of 
the diagnosable selection conditions, 

(b) a partial ordering (c) of the evaluation domain, specifying which 
diagnosable selection condition in the evaluation domain are better than others, and 

(c) a mapping function f that maps the diagnosable selection 
conditions to the evaluation domain, such that Ci>C 2 => f(C 2 ) c f(Cj), wherein Q and 
C 2 are diagnosable selection conditions. Arbitrary metrics may be applied to the 
DDO for comparing diagnosable selection conditions. Metrics that can be used 
include standard ones such as chi-squared value, confidence, conviction, entropy 
gain, laplace, lift, gain, gini, and support. 

The method further contemplates the steps of: 

specifying a semi-equivalence relation A on the diagnosable selection 
conditions to determine similarity thereof; 

specifying selection condition constraints S for the diagnosable 
selection conditions to meet where the selection condition constraints include 
minimal acceptable confidence, minimal acceptance support and maximum order of 
said diagnosable selection condition; 

specifying to the computer system a number of fringes of interest, F°, 
F 1 ..., wherein the fringe F° represents a best possible selection conditions with 
regard to a combination of respective said specified support, the specified 
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confidence and the specified "simpler- than" ordering the fringe F 1 represents the set 
of diagnosable selection conditions that are worse only the fringe F°, and the fringe 
F 1+1 represents the set of diagnosable selection conditions that are worse than fringe 
F s ; 

computing said optimal fringes F°, F\...F', F 1+! , and 

computing a compact set of the optimal fringes to eliminate redundant 

conditions, said compact set representing the optimal set of data association rules. 

A subset SF of said set F of the optimal fringes is defined as the compact 

representation of the set F with regard to the A, if: 

a. for each diagnosable selective condition sc eF,3sc' e CF such that 

scAsc' ; 

b. if scsF and sctCF , then 3sc' e CF such that 
scAsc' a H/fcc') c f(sc) v (f(sc') = f(sc))) 9 and 

c. there is no strict subset CF' of CF satisfying said conditions (a) 

and (b). 

The diagnosable conditions C are combined to form a diagnosable selection 
condition SC; and the diagnosable selection conditions SC are restricted to tight 
diagnosable selection conditions T, wherein the diagnosable selection condition SC 
is tight if for each diagnosable selection I < A.<u in the diagnosable selection 

condition SC, {cr (M=i ^ S c* D ( R )* °) a (^u^caz>(*)* o) where a is a relational 
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selection operator, wherein A t is a diagnosable attribute, and 1, ue dom (Aft are 
values defined by the user. 

Preferably, for computing the optimal fringes, a condition graph is created by 
enumerating a set of tight selection conditions T satisfying the selection condition 
constraints, and support and confidence defined by the user are evaluated. 

In the method of the present invention, the fringes are defined independent of 
the DDO. A plurality of distinct DDOs can be applied to the set of fringes without 
recomputing the set of fringes. 

The semi-equivalence relations A may be a distance-based relation or an 
attribute distance threshold based semi-equivalence relation or other relations. The 
semi-equivalence relation is specified by defining L CA and U C a to be the lower and 
upper bounds, respectively, of a respective diagnosable attribute A in said 
diagnosable condition C, and defining a diagnosable selective condition Ci as semi- 
equivalent to a diagnosable selective condition C 2 if: 

a. the set of diagnosable attributes appearing in the Ci is equivalent to 
the set of diagnosable attributes appearing in the C 2 ; 

b. for each numeric diagnosable attribute, A, appearing in the C x and 

the C 2 
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d(L CiA ,L c J<e A 
d K A ,U c J<e A 

where the <= A values are constants that differ based on the 
diagnosable attribute A; and/or 

c. for each non-numeric attribute, A, appearing in C x and 

c i>Lq a = l c 2a ■ 

Further the present invention is directed to a system for automated data 
diagnosis which includes 
a computer; 

means in said computer system for storing data to be diagnosed, the 
data characterizing an entity, 

means for forming a relation R containing the data to be diagnosed, 

means for creating a relation R containing the data characterizing the 
entity A = (A l3 ...A n , An+i,...^^ where n, m > 1, (the data are represented by 
outcome attributes A n+1 ,...A n+m and diagnosable attributes A!,...A n , the outcome 
attributes determining whether the entity is desirable or not, and the diagnosable 
attributes determining the reason of why the entity is desirable or not); 

an interface for communication between a user and the computer, the 
user inputting into said computer a plurality of selective conditions through the 
interface, and 
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means in the computer for computing optimal data association rules 
for the data to be diagnosed based on the selective conditions, wherein a 
combination of a confidence, support and simplicity of the selective conditions is 
considered. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a schematic block-diagram of the system for automated data 
diagnosis of the present invention; 

Fig. 2 is a flow-chart diagram of the method for automated data diagnosis of 
the present invention; 

Fig. 3 is a flow-chart diagram of the process of building the condition graph 
(COG) of the present invention; 

Fig. 4 is a diagram illustrating Execution Time of the procedure of the 
present invention vs. Order Constraint; 

Fig. 5 is a diagram illustrating Execution Time of the procedure of the 
present invention vs. Number of Attributes; 

Fig. 6 is a diagram illustrating Execution Time of the procedure of the 
present invention vs. the present of tuples that satisfy the outcome condition; 

Fig. 7 is a diagram illustrating Execution Time of the process of the present 
invention vs. Number of Tuples in the Data Set; and, 

Fig. 8 is a diagram representing Execution Time of the process of the present 
invention vs. Number of Fringes. 
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DESCRIPTION OF THE PREFERRED EMBODIMENTS 
Relational databases are used to record information about a domain in 
numerous applications. In further description of the present invention, three 
application examples are used including a Manufacturing Example, a Loan Default 
Prediction Example, and a Census Example for sake of clarity and ease of 
understanding of the principles of the present invention. However, as will be 
understood by those skilled in the art, these examples are not directed to limiting the 
scope of the method and system of the present invention but used for clarification 
purposes only. Other applications in various fields of endeavor may be considered as 
well. In different applications, including those used as an example and described 
infra, the relational database tuples, which are the rows of the database, are used to 
record information about a particular entity (for example, a product of a process). In 
the technique of the present invention, the attributes, that are the columns of the 
database, are split into two parts - outcome attributes, and diagnosed attributes. 
Outcome attributes are the attributes that may be used to determine whether an entity 
is desirable or not, while diagnosed attributes are used to determine why some 
entities are desirable while others are not. 
1.1 Manufacturing Example 

In the Manufacturing example, a manufactured product P represents an entity 
considered for data diagnosis. P may be something as simple as a coffee mug, as 
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complex as an LCD device (or other device). For each item of the product P 
manufactured, sensors may record various aspects of the product as it moves along 
the production line from the raw materials stage to the finished product (or part 
thereof). Each of these sensor readings, together with an "Itemld" attribute, may 
correspond to a diagnosed attribute. At the end of manufacturing process, quality 
control inspectors inspect the resulting product and either assign it 0 (i.e., passed 
inspection) or a defect code number (specifying a certain type of defect). This 
finding can then be represented in a Table as the outcome attribute "Inspection". An 
example Table of this kind is represented as Table 1. 
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TABLE 1 



Itemld 


Sensorl 


Sensor2 


Sensor3 


Sensor4 


Inspection 


1 


10 


5 


32 


9 


0 


2 


9 


6 


30 


8 


0 


3 


34 


5 


45 


9 


22 


4 


9 


6 


31 


8 


0 


5 


11 


6 


32 


8 


0 


6 


10 


7 


31 


9 


0 


7 


36 


6 


46 


8 


14 


8 


9 


6 


32 


9 


0 


9 


11 


7 


33 


8 


0 



In this example, the schema has only one outcome attribute, viz. Inspection, while 
all the other attributes (readings of sensors 1-4) are diagnosable attributes. 

1.2 Loan Default Prediction 

In the Loan Default Prediction example, a bank that is interested in 
determining what kinds of loans will default is considered for data diagnosis. Such a 
bank may use various parameters in measuring data, but for the purposes of an 
expository example, only a few are listed in the Table 2. 
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TABLE 2 



Name 


YPresAdd 


YPresEmpl 


YPrevEmpl 


Income 


LPay 


Age 


Dep 


Default 


Jennifer 
Brown 


1 


2 


1 


7900 


600 


27 


0 


0 


Jim 
Burk 


4 


4 


5 


6700 


1500 


39 


1 


0 


Brian 
Davis 


12 


9 


9 


5700 


1500 


44 


1 


0 


John 
Doe 


8 


10 


6 


7500 


2000 


41 


2 


0 


Lisa 
Johnson 


9 


7 


6 


8700 


3000 


48 


2 


0 


Mike 
Jones 


7 


9 


5 


7700 


100 


43 


3 


0 


Jane 
Shady 


3 


1 


1 


2000 


1500 


37 


1 


1 


Dan 
Smith 


6 


2 


2 


3000 


250 


32 


1 


1 


Melissa 
Williams 


11 


5 


7 


4500 


750 


42 


2 


1 



The columns in the Table 2 stand for name, years at present address, years at 
present employment, years at previous employment, monthly income, monthly loan 
payment, age, number of dependents, and whether or not the individual defaulted (1 
indicates they did default, 0 indicates they did not). 

Over the years, the bank may have accumulated a large set of data 
concerning loans they have made. Each such loan account may be described as a 
row in a table such as the Table 2. In such an application, conditions on the columns 
(ignoring the Name column) are to be found that separate out loans to default from 
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ones which will not as is this clearly has an impact on the bank's planning. In this 
application, all attributes, except for the Name and Default columns, will be used as 
diagnosable attributes. 

These are just two examples - in different markets - of how diagnostic 
inference can be extremely important to corporations. The Manufacturing example 
applies to virtually all manufactured products, ranging from paper cups to 
sophisticated airplane electronics. Likewise, the Loan Default Prediction example 
may be applied to a variety of other financial applications such as credit card default 
and commercial bankruptcy predictions. 

The following example presented infra herein is based on a real data set. The 
data comes from the adult data set, also known as the census income data set, on the 
UCI machine learning repository, C.L. Blake and C.J. Merz. UCI Repository of 
Machine Learning Databases. http://www.ics.uci.edu/~mlearn/MLRepository.html . 
Irvine, CA: University of California, Department of Information and Computer 
Science, 1998. The data consists of 32,561 tuples. Each tuple is broken down into 1 
outcome attribute and 11 diagnosed attributes. Of the 11 diagnosed attributes, 5 are 
numeric and 6 are non-numeric. 
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1.3 Census 

In the Census example, a set of data is considered for the diagnosis, which 
contains a Boolean outcome attribute, "Income", which indicates whether or not an 
individual makes more than $50,000 a year, and the following 11 diagnosed 
attributes, as shown in Table 3. 

TABLE 3 



age: numeric - [17, 90] 

workclass: non-numeric - {Federal-gov, Local-gov, Never-worked, Private, Self- 
emp-inc, Self-emp-not-inc, State-gov, Without-pay} 

education: numeric - {1 (Preschool), 2 (l st -4 th ), 3 (5^6% 4 (7 th -8 th ), 5 (9 th ), 6 
(10 th ) 7 (11 th ), 8 (12 th ), 9 (HS-grad), 10 (Some-college), 11 (Assoc-voc), 12 (Assoc- 
acdm), 13 (Bachelors), 14 (Masters), 15 (Prof-school), 16 (Doctorate) 
marital-status: non-numeric - {Divorced, Married- AF- spouse, Married-civ- 
spouse, Married-spouse-absent, Never-married, Separated, Widowed} 
occupation: non-numeric - {Adm-clerical, Armed-Forces, Craft-repair, Exec- 
managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, 
Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport- 
moving} 

relationship: non-numeric - {Husband, Not-in-family, Other-relative, Own-child, 
Unmarried, Wife} 

sex: non-numeric - {Female, Male} 
capital-gain: numeric - {0, 99999} 
capital-loss: numeric - [0, 4356] 
hours-per-week: numeric - [1, 99] 

native-country: non-numeric - {Cambodia, Canada, China, Columbia, Cuba, 
Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, 
Guatemala, Haiti, Holland-Netherlands, Honduras, Hong, Hungary, India, Iran, 
Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US (Guam-USVI, 
etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, 
Thailand, Trinadad & Tobago, United States, Vietnam, Yugoslavia 
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In the Table 3, age contains the individual's age; workclass contains the 
census code for the class of workers to which the individual belongs; education 
contains the highest level of schooling an individual has obtained on a scale of 1 to 
16; occupation contains the census code for the occupation of the individual; 
capital-gain and capital-loss contain the capital gains and losses the individual 
incurred during a year period respectively; hours-per-week contains the number of 
hours an individual works on average; native-country contains the native country of 
the individual. 

An example of a goal may be to find a set of conditions to understand what 
types of people earn more than $50,000 a year. The following is one such possible 
set of conditions that can be generated via the framework for diagnostic inference of 
the present invention applied to the census data set: {7688 < capital-gain < 2005 1 , 
3103 < capital-gain < 9999, workclass = Self-emp-inc, occupation = Exec- 
managerial, 12 < education-num < 16, relationship = Husband, maritalstatus = 
Married-civ-spouse, 26 < age < 75, race = White, native-country = United States, 2 

< education-num < 16, relationship = Wife, occupation = Prof-specialty, 37 < age 

< 59, sex = Male}. 

For example, the inclusion of 12 < education-num < 16 in the set can be 
interpreted as a partial explanation as to why someone is able to earn more than 
$50,000 a year is the fact that they hold an advanced degree. One should recognize 
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that the number of conditions that may be included in the set is enormous. The 
method of the present invention permits to narrow the set of conditions to the fifteen 
that appear supra. 

The process of narrowing the enormous space of conditions is the core of 
automated data diagnosis of the present invention, to provide the efficient and the 
least time consuming computation of the diagnostic inference problem. 

It is important to note that the method of the present invention applies to any 
relation, R, where the attributes can be split into a set of diagnosed attributes and a 
set of outcome attributes. The goal of the process is to specify and compute an 
optimal set of conditions on the diagnosable attributes of R for an outcome condition 
D. In the framework of the present invention, a selection condition will be more 
likely to appear in the optimal set if: 

• the accuracy of the condition is high; 

• the number of tuples that supports the validity of the condition is high; 

• the condition is relatively simple; and 

• the condition is not similar to other "desirable" conditions. 

The framework of the present invention for diagnostic inference has been 
designed to meet the following criteria: 

1 . The framework should handle numeric as well as non-numeric attributes; 
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2. By varying a parameter a user should be able to vary the number of 
conditions returned; 

3. Any condition that is not returned, but satisfies all constraints, must be 
very similar to a condition that is returned, or alternatively strictly worse than a 
specified number of other conditions according to any reasonable objective measure 
of the "goodness" of conditions (including measures that take into account the 
simplicity of conditions); 

4. The set of conditions returned should not have many more conditions 
beyond that which is required to ensure the previous item; and 

5. No two conditions which are essentially the same should be returned. 
The framework of the present invention achieves all five of the above 

criteria. 

In the following description, the definitions presented infra are used. 

It is assumed that there is a relation R over schema A = 
(A,,.. A n+m ), where n, m > 1, attributes Ai,. . .A, are the diagnosed 

attributes, and A n+1 ,. . .A n+m are the outcome attributes. Each attribute Ai has an 
associated domain, dom (Ai). There is no loss of generality in making these 
assumptions. 
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Definition 2.1 (outcome condition (OC)) Suppose R is a relation as 
assumed supra. An outcome condition D is any selection condition that only 
involves outcome attributes. 

Example 2.1 (manufacturing example 1.1 revisited) Returning to the 
Manufacturing example 1.1, two possible outcome conditions that one may want to 
use are Inspection = 14 and Inspection ^ 0. 

Example 2.2 (loan default prediction example 1.2 revisited) Returning to 
the loan default prediction example 1.2, a possible outcome condition that one may 
want to use is Default = 1. 

Example 2.3 (census example 1.3 revisited) Returning to the census 
example 1.3, the outcome condition used is Income=true, where Income=true 
means an individual earns more than $50,000 a year. 

Definition 2.2 (atomic diagnosable condition) If A { is a diagnosable 
attribute and 1, u, e dom (AO are values, then 1 < Aj < u is called an atomic 
diagnosable condition. If 1 = u, then we will also use A; = 1 to denote 1 < Aj < u. 

All attributes can be either numeric or non-numeric; if Aj in the above 
definition is a non-numeric attribute, then it must be the case that 1 = u. 

Example 2.4 (manufacturing example 1.1 revisited) In the manufacturing 
example 1.1 34 < Sensorl < 36 and Sensor2 = 6 are both examples of atomic 
diagnosable conditions. 



24 



Example 2.5 (loan default prediction 1.2 revisited) In the loan default 
prediction example 1.2 4500 < Income < 4700 provides an example of an atomic 
diagnosable condition. 

Example 2.6 (census example 1.3 revisited) In the census example 1.3 
occupation = Exec-managerial provides an example of an atomic diagnosable 
condition. 

Diagnosable conditions can be combined to form a diagnosable selection 
condition which is defined infra. 
Definition 2.3 (diagnosable selection condition) 

1 . Every atomic diagnosable condition is a diagnosable selection condition. 

2. If sci, sc 2 are diagnosable conditions, then sc! a sc 2 is a diagnosable 
selection condition. 

In the method of the present invention, the attention is restricted to a special 
type of diagnosable selection conditions, which is called tight diagnosable selection 
conditions. 

Definition 2.4 (tight diagnosable selection condition) Suppose R is a relation 
instance over schema A and D is an OC. A diagnosable selection condition sc is 
tight iff for each atomic diagnosable condition 1 < Aj < u in sc, 
(°"(^ ( =/)a5cad(^) * °) A (°u= tt )A fC AZ> ( R ) * o) where a is the standard relational selection 
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operator (J.D. Ullman. Principles of Database and Knowledge Base Systems, 
Computer Science Press, 1989). 

Intuitively, a tight diagnosable selection condition is a diagnosable selection 
condition in which a tuple satisfying the outcome condition occurs on every 
boundary of the region described by the diagnosable selection condition. 

The following examples show some sample tight diagnosable selection 
conditions. 

Example 2.7 (manufacturing example 1.1 revisited) Sensorl=36 a Sensor2=6 is 
an example of a tight diagnosable selection condition for an OC of Inspection^ 4. 
Example 2.8 (loan default prediction example 1.2 revisited) 
3000 < Income < 4500 is an example of a tight diagnosable selection condition for 
an OC of Default=l. 

In the last example, 3000 < Income < 5700 would not have been a tight 
diagnosable selection condition, since there is no tuple satisfying 
Income=5700ADefault=l (Table 2). 

Example 2.9 (census example 1.3 revisited) 7688 < capital-gain < 20051 is an 

example of a tight diagnosable selection condition for an OC of Income=true. 

It is noted that given a Condition C such that at least one tuple in R satisfies 
the outcome condition D, there exists a tight selection condition C pertaining to the 
same attributes that C does such that the following desirable properties hold: 
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1. ct c ^ d (r)^<j c ^ d {r) 

These properties mean that for any condition C that is not tight, there will exist a 
closely related tight condition C, which as will be presented infra is no worse than 
C, and possibly better. By restricting the attention to tight conditions, the technique 
will be able to drastically reduce the number of conditions that must be considered 
explicitly, while not overlooking any conditions of interest. For example, in the 
census income data set, there are 5,000,050,000 conditions for the attribute capital- 
gain with integral lower and upper bounds in the range [0,99999], yet only 7,140 
tight conditions. 

Definition 2.5 (confidence of a selection condition) Suppose R is a relation 
instance over schema A, C is a tight diagnosable selection condition, and D is an 
OC. The confidence of C is defined as: 

conf(C)= Cfl ^fa*D(*)) 
card{a c {R)) 

Confidence of a tight diagnosable selection condition measures its accuracy: how 
many of the tuples that satisfy the tight diagnosable selection condition also satisfy 
the outcome condition. In the manufacturing example 1.1, conditions on the sensor 
readings C on sensor readings are being looked for such that the proportion of tuple 
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satisfying C and being defective is relatively high when compared to the overall 
defect rate. 

In the loan default prediction example 1 .2, the conditions that correspond to 
customers defaulting on their loans at high rates are of interest, that is conditions 
with high confidence when Default=l. Confidence is one well known measure of 
quality of a tight diagnosable selection condition. Another is support defined in the 
following paragraph. 

Definition 2.6 (support) Suppose R is a relation instance over schema A, C is a 
tight diagnosable selection condition and D is an OC. The support of C is defined 

as: 

sup(C)= card{a c ^ D (*)). 
It is important to note that support only measures the cardinality of the 
numerator of the formula used to compute confidence. For example, a condition can 
be found with conf(C) = 1 in which only one tuple satisfies C aD. Such a condition 
would have a support of just 1. Ideally, we would like to find conditions that have 
both high support and high confidence. However, in practice there is a trade-off 
between support and confidence. The rules with the highest confidence will 
generally have relatively little support. 
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Example 2.10 (manufacturing example 1.1 revisited) Suppose the OC, D, is 
Inspection-^. 

Suppose Ci is Sensorl=36, then: 
conf(C!) = 1 
sup (C,) = 1 

since the one tuple that satisfies Q also satisfies Ci a D. 
Suppose C 2 is 6<Sensor2<7ASensor4=8, then: 

conf(C 2 ) = ^ 

sup(C 2 ) = 1 

since in this case, there is one tuple that supports the condition C 2 a D out of a total 
of five conditions satisfying condition C 2 . 

Example 2.11 (loan default prediction example 1.2 revisited) Suppose the OC, 

D, is Defaults. 

Suppose that d is Dep=l, then: 

conf(C 1 )=| 

sup(C) = 2 

since there are two tuples that support the condition Q A D out of a total of four 

conditions satisfying condition CI. 

Suppose that C 2 is 2000 < Income < 4500, then: 
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conf(C 2 ) = 1 
sup(C 2 ) = 3 

since all three tuples that support the condition C 2 also support the condition D. 
Example 2.12 (census example 1.3 revisited) The support and confidence of 
the fifteen conditions in the census example are listed in the following Table 4. 



TABLE 4 



Condition 


Confidence 


Support 


7688 < capital-gain < 2005 1 


0.992 


916 


3103 < capital-gain < 99999 


0.749 


1677 


Workclass=Self-emp-inc 


0.557 


622 


Occupation=Exec-managerial 


0.484 


1968 


12 < education-num < 16 


0.457 


4174 


Relationship=Husband 


0.449 


5918 


Marital-status=Married-civ-spouse 


0.447 


6692 


26 < age < 75 


0.297 


7687 


Race= White 


0.256 


7117 


Native-country= United States 


0.246 


7171 


Relationship=Wife 


0.475 


745 


Occupation=Prof-specialty 


0.449 


1859 


2 < education-num < 1 6 


0.242 


7841 


37 < age < 59 


0.370 


5221 


Sex=Male 


0.306 


6662 



In comparison a condition that included the entire data set would have had a 
confidence value of 0.241 and a support value of 7,841. 

In some cases there are conditions that are of no interest which prompts to 
define a selection condition constraint. 
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Definition 2.7 (selection condition constraint) A selection condition constraint is 
a Boolean predicate on selection conditions, that all selection conditions returned 
must satisfy. 

Two common examples of selection condition constraints are presented infra. 
Many more examples are possible. 

Example 2.13 (order k restriction) A selection condition sc satisfies an order k 
restriction if the total number of attributes appearing in sc is k or less. 
Example 2.14 (manufacturing example 1.1 revisited) The condition 
Sensorl=36 a 5 < Sensor2 < 6 would satisfy an order 2 restriction, but not an order 1. 
restriction. 

Example 2.15 (loan default prediction example 1.2 revisited) The condition 
(Dep=l) a (3000 < Income < 5000) would also satisfy an order 2 restriction, but not 
an order 1 restriction. 

Example 2.16 (census example 1.3 revisited) All conditions in the set presented 
previously for the census data example satisfy an order 1 restriction. 
Example 2.17 (support-confidence lower bounds) The conf (C)>p may be used 
to denote the selection condition constraint that evaluates to true iff the confidence 
of C is p or more. Similarly, the sup(C) > s may be used to denote the selection 
condition constraint that evaluates to true iff the support of C is s or more. Here, p 
and s are real numbers. 
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The selection condition constraint conf(C) > 0.3 says that the precision of the 
selection condition must be 0.3 or more (for it to be acceptable). Likewise, the 
selection condition condition constraint sup > 4 says the same w.r.t. support. 

The term "selection condition" will be used further herein as shorthand for a 
"tight diagnosable selection condition that satisfies all selection condition 
constraints". 

Among conditions that satisfy the selection constraints, the conditions are of 
interest with a relatively high combination of support and confidence. In addition, 
relatively simple conditions are of interest, that still have high support and 
confidence. Simpler conditions are less likely to "overfit" the data, and thus in some 
situations can be more useful than complicated conditions that may have higher 
support and confidence. The interest in simpler condition leads to definition of a 
simpler-than ordering. 

Definition 2.8 (simpler-than ordering > simpler) The existence is assumed of a 
simpler-than ordering simpler which is a reflexive and transitive relation on the set of 
all selection conditions, where Q > simpler C 2 means Q is no more complex than C 2 . 

The following example presents some sample simpler-than orderings. 
Example 2.18 There are many possible simpler-than orderings. 

1 • ^simpieri- Q > S im P ieri C 2 iff every attribute occurring in C\ also occurs in 

C 2 . 
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2. ^sim P ier2- Alternatively, Q > simpler2 C 2 iff C\ has the same or fewer distinct 
attributes occurring in it than C 2 . 

3. > s i mp ier3- Alternatively, Q > simpler 3 C 2 iff C\ has fewer distinct attributes 
than C 2 or Ci; has the same number of distinct attributes as C 2 but fewer numeric 
attributes. 

The example 2.18 presents a set of different "simpler-than" relations. Many 
others are possible as well. In the subject framework, the application developer can 
select a "simpler-than" relation he/she likes and use it in conjunction with the reset 
of the framework. 

Example 2.19 (manufacturing example 1.1 revisited) Consider Q and C 2 from 
Example 2.10, and the simpler-than orderings from the previous example, then 

Ci > S impler2 C 2 and C\ > S i mp ler3 C 2 . 

Example 2.20 (loan diagnosis example 1.2 revisited) Consider Q and C 2 from 
Example 2.11, and the simpler-than orderings from Example 2. 1 8, then d > sini pier2 

C 2 , Ci> sim pi ei .3 Ci, C 2 > s i mp ler2 Q, and C 2 > s i mpler 3 Q. 

Example 2.21 (census example 1.3 revisited) In the census example we have * 
(workclass=Self-emp-inc)> simpler2 (7688 < capital-gain < 20051), (workclass = Self- 
emp-inc) > simp ier3(7688 < capital-gain < 2005 1 ), and (7688 < capital-gain 
< 2005 1 ) > S impier2 (workclass=Self-emp-inc). 
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Further herein, it is assumed that the application developer has selected an 
arbitrary but fixed > simp ier simpler than ordering. 

As the next step, support, confidence, and simplicity are combined into one 
ordering. Combining support, confidence, and simplicity into one ordering will 
permit us to specify that one condition will be better than another condition, 
according to any type of "reasonable" type of measure, if the condition is better 
according to the ordering which will be defined next. The notion of a "reasonable" 
measure will be formalized when a DDO is defined infra. 

Definition 2.9 (support, confidence, simplicity ordering >,>-) Suppose that Q, 
C 2 are selection conditions. It is said that Q >- C 2 , iff conf(Ci) > conf(C 2 ) and sup(Ci) 
and Ci > S i m pier C 2 . Additionally, it is said that d X C 2 iff d > C 2 and it is not the 
case that C 2 >Ci. 

Intuitively, Ci ^ C 2 means that Q is better than C 2 in either support, 
confidence or simplicity and is not worse than C 2 in any of the others. 
Example 2.22 (manufacturing example 1.1 revisited) Consider the selection 
conditions d, C 2 from Example 2.10 and suppose > S i mp ier2 from Example 2.18 is 
selected as the simpler-than ordering, then C\ >- C 2 , since C x is better than C 2 in terms 
of confidence and simplicity, and is not worse in terms of simplicity. 
Example 2.23 (loan default prediction example 1.2 revisited) Consider the 
selection conditions C u C 2 from Example 2.1 1 and suppose > S im P ier 2 from Example 
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2. 1 8 is again selected as the simpler-than ordering, then C 2 >C\ , since C 2 is 
better than Ci in terms of support and confidence, and is not worse in terms of 
simplicity. 

Example 2,24 (census example 1.3 revisited) Consider the selection conditions 
(maritaIstatus=Married-civ-spouse), (37 < age < 59) and suppose > S im P ier3 from 
Example 2.18 is again selected as the simpler-than ordering, then 
(maritaIstatus=Married-civ-spouse) ^(37<age <59), since 
(maritalstatus=Married-civ-spouse) is better than (37 < age < 59) in terms of 
support, confidence, and simplicity. 

An up-set will now be defined which uses the >- ordering to specify the set of 
selection conditions which are better than a given selection condition. 
Definition 2.10 (up-set) Suppose R is a relation instance over schema A, C is a 
selection condition, D is an OC, S is a set of selection condition constraints. The up- 
set of C w.r.t. the above parameters, denoted up(C), is {C'| C is a selection 
condition satisfying S and CVC}. 

Intuitively, the up-set of C denotes the set of all selection conditions that are 
better than C in terms of the support, confidence, simplicity ordering. 
Example 2.25 (manufacturing example 1.1 revisited) Suppose D is 
Inspection * 0, the set of selection condition constraints are empty, and the 
simplicity ordering is > sim pier2 from Example 2.18, then the selection condition 
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34 < Sensorl < 36 would have an empty up-set, since no condition has better support, 
confidence, or simplicity than 34 < Sensorl < 36. 

Example 2.26 (loan default prediction example 1.2 revisited) Suppose D is 

Default=l, the set of selection conditions are empty, and the simplicity ordering is 

^sim P ier2 from Example 2.18, then 1 < YPrevEmpl<2 has an up-set of 

{2000 < Income <4500} . 2000 < Income < 4500 is in the up-set because it has better 

support and confidence than 1 < YPrevEmpl <2, and is no worse in terms of 

simplicity. 

Example 2.27 (census example 1.3 revisited) Returning to the census data 
example, and suppose the simplicity ordering is > simple*. The up-set of the condition 
12 <education-num< 16 is empty, while the up-set of the condition 
relationship=Wife is non-empty since (Exec-managerial= occupation) >- 
(relationship=Wife). 

Further, the definition of an up-set is used in defining a set of fringes. The 
goal in defining fringes is to form a series of sets of selection conditions, such that if 
a selection condition C is in the i th set, it can be guaranteed that there are at least i 
conditions which are better than C, according to any "reasonable" measure. Fringes 
will play an important role in insuring that the second, third, and fourth criterions 
set-forth in the previous paragraphs are achieved. 
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Definition 2,11 (fringes) Suppose R is a relation instance over schema A, C is a 
selection condition, S is a set of selection condition constraints, and A is a semi- 
equivalence relation on selection conditions. The fringes F°, F 1 ,. . .of the data 
diagnosis problem w.r.t these definitions are as follows: 
F°={C| up(C) = 0}. 

F i+1 ={C| up(C)c U up(C')}. 

Fringe F° represents the best possible selection conditions w.r.t. to the 
support, confidence, simplicity ordering. Fringe F 1 represents the set of selection 
conditions that are only worse according to the support, confidence, simplicity 
ordering than selection conditions on F°, and so on. 

Example 2.28 (manufacturing example 1.1 revisited) Suppose D is Inspection 

* 0, S requires all conditions to have order at most 1 and support of at least 2, and 
the simplicity ordering is > sini pier2 from Example 2.18, then F° consists of 
{34 < Sensor, < 36, 45 < Sensor 3 < 46} . 

In the last example, by the definition of fringes, any condition which satisfies 
the selection conditions constraints and does not appear in F° must be worse in terms 
of >- than 34 <Sensor! <36 or 45<Sensor 3 <46. 

Example 2.29 (loan default prediction example 1.2 revisited) Suppose D is 
Default=l, S requires all conditions to have order at most 1 and support of at least 2, 
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the simplicity ordering is > sim pier2 from Example 2.18, then F° is 

{2000 < Income < 4500}, and F 1 is {l<YPresEmpI<5, 32 < Age < 43,32 <42,32 < 

Age < 37}. 

In some situations a condition Ci may be considered to be "better" than C 2 where it 
is neither the case that Ci >- C 2 or C 2 >- Ci. Further, a general structure called a data 
diagnosis objective (DDO) will be presented which will permit, if desired, to order two 
conditions, Q and C 2 , based on any "reasonable" measure of goodness of conditions, even 
if it is not the case that Ci >- C 2 or C 2 >- Ci . 

Definition 2.12 (data diagnosis objective (DDO)) A data diagnosis objective (DDO or 
short) is a triple (£D,c,/) where: 

1. ED is a nonempty set called the evaluation domain, and 

2. <z is a partial ordering on ED, and 

3. f is a mapping from selection conditions to ED such that Q >- C 2 => f(C 2 ) c f(Ci). 
Intuitively, a DDO specifies three components, (E,D, &/) . ED is a set that will 

provide the "yardstick" for measuring goodness of a selection function; c will specify 
which values in ED are better than other values. Finally, f will explain how to associate a 
value in ED with a selection condition. The technique of the present invention is extremely 
generic and allows a wide range of DDOs to be used for diagnostic inference, including all 
that are considered by the user "reasonable". 

Some samples of DDOs are presented infra herein. 
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Example 2.30 There are many possible DDOs that could be used by a diagnostic system. 
Two possible DDOs of interest are provided infra, where a simplicity ordering is assumed 
to have been fixed. 

1 . The triple can be ( [0 9 l\ <ax conf (C) + b x sup ( C ) \ Here, the evaluation 

\ sup(Z))/ 

domain consists of real numbers in the unit interval [0,1]. The values in this domain are 
ordered according to the usual less than or equals ordering, f returns a linear combination, 
based on the confidence and support values as well as the constants a and b which are 
chosen so that a + b = 1 . 

2. The triple can be (i? x ./) where: 

• R x R is the set of pairs of real numbers; 

• [x,y] c: [x',y] iff x < x' and y < y' and 

• f(C) = [conf(C), sup(C)]. 

As the example 2.30 illustrates, f does not have to provide a total ordering for 
conditions. Virtually all standard metrics for comparing selection conditions can also be 
used in DDOs. Examples of standard metrics include chi-squared value, confidence, 
confiction, entropy gain, laplace, lift, gain, gini, and support. 

It is important to recognize that the fringes are defined independently of the DDO 
and therefore it is possible to apply several DDOs to a set of fringes without recomputing 
the fringes. In fact, the fringe that a condition appears on, can also be a convenient basis for 
a DDO, as is seen in the next example. 
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Example 2.31 (fringe based DDO) 

A DDO can be constructed based on the fringe construction, let f(C) = i where CeF 1 , then 
the triple (7V,>,/) where f(C) = i defines a DDO. 

Even if two selection conditions are "good" according to the selected DDO, a user 
may not be interested in knowing both conditions if the user deems them to be "more or 
less" the same, even though they look syntactically different. The concept of a semi- 
equivalence relation is defined infra which is intended to capture this issue. 
Definition 2.13 (semi-equivalence relation on selection conditions) A binary relation A 
on selection conditions used to compare the similarity of conditions is said to be a semi- 
equivalence relation iff A is reflexive and symmetric. 

It is important to note that a semi-equivalence relation does not need to be transitive. 
Intuitively, all conditions that are semi-equivalent to each other may be considered to be the 
same. There are many conditions under which two conditions may be held to be semi- 
equivalent. For example, for a given application, it might be said that the conditions 0.0001 
< A < 1 is semi -equivalent to the condition 0 < A< 1 because these conditions are almost 
the same. Some examples of semi -equivalence relations are presented infra. 
Example 2.32 It can be said that two diagnosable selection conditions C u C 2 are: 

1. instance-same iff ac x (R) = ac 2 (R); 

2. strongly instance same iff oe x (R) = oc 2 (R) and the set of all attributes in Ci 
equals the set of all attributes occurring in C 2 . 
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Both the relations instance-same and strongly instance same are semi-equivalence relations. 
(In this case, they also happen to be equivalence relations, but this is not always the case). 
Example 2.33 (distance based semi-equivalence relation) Suppose a function d which 
maps pairs of selection conditions to R u { oo } - furthermore, suppose d satisfies the 
following conditions: 

d(C,C) = 0. 
d(C,,C 2 ) = d(C 2 ,C 1 ). 

The function d is not required to satisfy the triangle inequality. The procedure of the 
present invention groups selection conditions into clusters such that any two selection 
conditions inside the same cluster lie within a threshold distance of each other. If the space 
SC of all selection conditions is limited to be finite and t > 0 is a value (threshold), then the 
Algorithm 1 may be used to compute a semi-equivalence relation. 

Algorithm 1 uses a threshold number (denoted MAXDIST) to determine when two 
conditions are sufficiently similar. 

ALGORITHM 1 



proc makebuckets 

B = 0, % no buckets yet 
While SC * 0 do 

Pick a selection condition sc g SC; 

SC=SC- {sc}; 

If (36 g B)(Vsc' g b)d(sc 9 sc')< MAXDIST then 

b = bKj {sc\ 
else 

B = fiu {{sc}};% create new bucket, 
end while 
return B; 

end proc 
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At the end of the Algorithm 1, a set of buckets is returned. It is said that Q, C 2 are 
semi-equivalent if they belong to the same bucket. Different choices in how sc is picked 
may lead to different buckets. In order to avoid this (and perhaps to achieve optimized 
"bucketing") the following procedure can be used: 

Define the width of a bucket b to be max ({d(sc,sc f ^sc,sc f e b}). Suppose, a partition 

P l ^j--KjP r of all selection conditions. The width of the partition is ^^^wdih{P t ) . The 

smaller the width of the partition, the more similar items within a bucket are deemed to be 
by the distance function d. Hence, the goal is to find a partition of the set of selection 
conditions that has minimal width. The following theorem 2.1 says that computing a 
minimal width partition is NP-complete, as presented in C.E. Leiserson, R.L. Rivest, and C. 
Stein. Introduction to Algorithms. MIT Press, 2001. 

Theorem 2.1 Suppose d is a polynomially computable distance function, t > 0 is a 
threshold, and C is any set of selection conditions. Suppose P is a partition of C. Checking 
if P is a minimal width partition is NP-complete. 

The following Algorithm 2 can be used to find a partition of C which is optimal. 

ALGORITHM 2 

proc make_optimal_buckets 

OPEN = {B};CLOSED = 0; 
Best = NIL\BestWidth =<x>; 
while OPEN * 0 do 
Pick a partition peOPEN; 

( * generate possible new partitions by merging two buckets in p.* ) 
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( * insert the new partition int o OPEN as long as it does not exist in * ) 
( •either OPEN or CLOSED* ) 

foreach b l9 b 2 e pdo 

if (p -{b X9 b 2 }Kj{b x Kjb 2 })<£ CLOSED u OPEN then 
insert ((/? -{b X9 b 2 }Kj {b x uAJ, OPEN ) 
if width(p - {b l9 b 2 }u {b x Kjb 2 })< bestWidth then 
Best = p-{b x >b 2 }Kj{b x yjb 2 }\ 
BestWidth = width(p-{b X9 b 2 }KJ {b x u6 2 }); 
end foreach 
OPEN = OPEN -{p}; 

CLOSED = CLOSED u {p} ; 
endwhile 
return itesf; 

end proc 



As in the case of the make_buckets Algorithm A, the make_optimal_buckets 
Algorithm 2 may also be used to define a semi-equivalence relation. Specifically, two 
selection conditions are considered to be semi-equivalent if they are in the same bucket. 

In the next example 2.34 infra, a semi-equivalence relation is presented that is easy 
to compute. The existence of a distance metric, d, is assumed for all numeric attributes. 
Example 234 (attribute distance threshold semi-equivalence relation) Suppose for an 
attribute, A, and a condition, C, L C a and U C a are defined to be the lower and upper bounds, 
respectively, of attribute A in condition C. 

It can then be said that Q and C 2 are semi-equivalent iff all the following hold: 

1. The set of attributes appearing in Ci is equivalent to the set of attributes 
appearing in C 2 . 

2. For each numeric attribute, A, appearing in Ci and C 2 
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(a)d(L clA ,L C2A )<s A 
{b)d(U C)A ,U C2A )<e A 

where the s A values are constants that may differ based on the attribute A. 
3. For each non-numeric attribute, A, appearing in Ci and C 2 , L C ia = Lc2A- 
Example 2.35 (Manufacturing example 1.1 revisited) Returning to the manufacturing 
example 1.1 and suppose an attribute distance threshold semi-equivalence relation with 

^sensor* = * d = | *| ^ th e standard absolute value metric. Then the conditions 45 < 

Sensor 3 <46 and 44 <Sensor 3 <47 are semi-equivalent. However, the conditions 45 
<Sensor 3 <46, 45 < Sensor 3 < 48 would not be semi-equivalent. 

Example 2.36 (loan default prediction 1.2 example) Returning to the loan default 
prediction example 1.2, suppose an attribute distance threshold semi -equivalence relation 
with e /rtComc = 500 andd = \-\, the standard absolute value metric. In this case 2000 

< Income < 4000 and 2250 < Income <4250 would be semi -equivalent, but the conditions 
2000 < Income < 4000 and 2750 < Income < 4750 would not be. 

Example 2.37 (census example 1.3 revisited) Returning to the census data example 1.3, 
suppose an attribute distance threshold semi-equivalence relation with d = |-| , the standard 

absolute value metric. Further suppose that e Age = 15 , =2 , e Cfl/ ,., fl/ _ /aM =10000 , 

^ hour,- per-^ek = 10 > then no two conditions in the set presented in Example 1.3 are semi- 
equivalent. 
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As in the census data example 2.37, defining an appropriate semi-equivalence will 
provide the means to ensure that the fifth criteria for diagnostic frameworks outlined supra 
will be achieved, namely that the set of conditions returned will not have any redundant 
elements. 

The definition of semi-equivalence is employed in the method of the present 
invention to a set of fringes, to eliminate redundant elements. In view of this, a compact set 
is to be defined, as presented infra. 

Definition 2.14 (compact set) Suppose C is a selection condition, (ED,^,f) is a DDO, 
and A is a semi -equivalence relation on the selection conditions. A subset CF of a set of 
selection conditions F is said to be a compact representation of F w.r.t. A iff: 

1 . For each sc e F, Bsc' e CF such that sc A sc' 

2. If sc e F and sceCF, then Bsc' e CF such that 
scAsc' {^{f{sc r ) cz 9 f(sc)y(f(sc') = f(sc)) 

3. There is no strict subset CF' of CF satisfying the above two conditions. 

The first condition requires that any selection condition in the original set, be semi- 
equivalent to at least one selection condition in the compact set. The second condition 
requires that for any selection condition that was removed from the original set, there exist a 
semi-equivalent condition in the compact set that is not strictly worse than the removed 
selection condition. The third condition requires a compact set be minimal. 
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Some examples of the compact sets are presented infra. 

Example 2.38 (manufacturing prediction example 1.1 revisited) Consider F° from 
Example 2.28, and suppose the semi-equivalence relation is a distance threshold relation 
with all e values set to 1. Independent of the choice of DDO, the compact representation 
of F° is still {34 < Sensor i < 36,45 < Sensor 3 < 46}, since 34 < Sensor i < 36 and 
45 < Sensor 3 < 46 are not semi-equivalent. 

Example 2.39 (loan default example 1.2 revisited) Consider F° uF y from Example 2.29, 
where the semi-equivalence relation is a distance threshold relation with 

^YPresAdd=^ypresE mp i^A g e^5 9 G Income =e LPay = 500 9 ^ Dep =\. Suppose a fringe based DDO is 

chosen, then the two possible compact representation of F° kjF 1 are 

{2000 < Income < 4500, YPresEmpl<5, 32, < Age < 42} and {2000 < Income < 4500, 

1 < YPresEmpl < 5,32 < Age < 47}. 

In the last example if the DDO has ordered 32 < Age < 42 before 32 < Age < 47 
or vice versa, then there would have only been one possible compact set. Also note that in 
the last example, that if e^ e <5, the compact set would be F° \J F 1 . 

Example 2.40 (census example 1.3 revisited) Suppose ^ simpler2 from Example 2.18 is 
used as the simplicity ordering, then 
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{7688 < capital-gain < 2005 1 ,3 1 03 < capital-gain < 99999, workclass=Self-emp-inc, 
occupation=Exec-managerial, 12 < education-num < 16, relationship=Husband, 
maritalstatus=Married-civ-spouse, 26 < age < 75} cz F° 

while 

{2 < education-num < 16, race=White, native-country=United States, relationship=Wife, 
occupation=Prof-specialty, 37 < age 59, sex=Male} cf 1 . 

In the example 2.40, only a subset of F° and F l has been presented. In particular, it 
is to be noted that 2 < education-num < 16 and 37 < age < 59 appear on F 1 while no 
conditions from their up-set were returned. No conditions from their upset were returned, 
since they were all semi-equivalent to at least one condition that was returned. 

Taking the compact representation of a set of fringes will permit to achieve the 
third, fourth, and fifth criterions set forth supra. By allowing the user to specify the number 
of fringes of interest, the user has some control over the number of conditions returned, thus 
achieving the second criterion from the set thereof. The framework presented handles 
numeric and non-numeric data, and thus also meets all the criteria we have established for a 
desirable diagnostic inference framework. 

The key goal of the automated data diagnosis solved by the algorithms of the 
present invention presented in detail infra, is stated as follows: 
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Suppose R is a relation instance over schema A, D is an OC, (£D,c,/) is a DDO, 

S is a set of selection condition constraints, > simpler is a simpler-than ordering, A is a semi- 
equivalence relation on selection conditions. For a given integer j > 0, finding a compact 
representation of F° kj F will give an optimal set of conditioning of the diagnosable 
attributes of the relation R for the outcome condition OC. 

Example 2.41 (census example 1.3 revisited) The set presented in Example 1.3 is a 
compact representation of F° ... F 1 where the outcome condition OC is Income=true, 
the DDO is a fringe based DDO, the selection condition constraints contains just an order 1 
restriction, the simpler-than ordering is > S im P ier2 from Example 2.18, and the semi- 
equivalence relation is from Example 2.37. 

Referring to Fig. 1, a system for an automated manufacturing data diagnosis of the 
present invention includes a computer system 12 which includes input means 14 for 
receiving and storing measured parameters of the product 16, R creating block 18 for 
creating (or specifying) a relation R containing the data corresponding to the input 
measurement results, an interface unit 20 for interfacing of the computerized system 12 
with the user 22 permitting the user to specify certain parameters, needed for the automated 
data diagnosis, and a processor 24 ran by the software 26 of the present invention for 
computing the optimal set of conditions for separating desired attributes of the product 16 
from undesirable ones. The system 10 of the present invention also includes means 28 for 
obtaining data characterizing the product 16. In the specific example which is the 
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manufacturing process for fabrication of the product 16 discussed herein for the sake of 
simplicity but not in order to limit the system of the present invention only to this specific 
application, such data acquisition means 28 includes sensors for measuring different 
characteristics of the product 16. The measured characteristics of the product 16 then are 
submitted to the input 14 of the computerized system 12 for further analysis. 

Referring again to Fig. 1, and also to Fig. 2, which represents a flow chart diagram 
of the process for automated data diagnosis of the present invention, the user 22 in the 
logical block 30 "Create a Relation R Containing Data to be Diagnosed" creates a relation 
R containing the data to be diagnosed. This relation is either created in the block 18 of the 
system 10, best shown in Fig. 1, or alternatively, if a single relation already exists apriori, 
then the step may merely involve specifying what that relation R is, or may involve 
eliminating irrelevant columns from that relation (which can be performed by projection 
operation well-known in the relational algebra, as best described in J.D. Ullman, Principles 
of Database and Knowledge Based System, Computer Science Press, 1989). Alternatively, 
the application developer may need to access multiple data sources in order to create such a 
relation. The relations created in block 30 may have the form of the Tables 1, 2, or 3, or 
any other relational database applicable to different applications. For example, for the 
manufacturing example 1.1, Table 1 will serve as a relation R where the columns of the 
Table represent attributes of the product, while the rows of the Table 1 represent tuples 
which record information about a particular entity (item ID). In the Table 1, for each item 
of the product 16 manufactured, the sensors 28 record various aspects of the product as it 
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moves for example along a production line, from the raw material stage to the finished 
product, or part thereof. 

From the logical block 30, the procedure moves to block 32 "Identify Outcome and 
Diagnose Attributes". In the Table 1, each of the sensors readings, together with an 
"ItemID" attribute, correspond to a diagnosed attribute. While the attribute "Inspection" 
corresponds to the outcome attribute. At the end of the manufacturing process, quality 
control inspectors inspect the resulting product and either assign it 0 (pass inspection), or a 
defect code number, specifying a certain kind of defect. This finding can then be 
represented in the Table 1 as the outcome attribute inspection. 

From the block 32, the logic flows to the block 34 "Specify Outcome Condition D". 
In this step, the user 22 specifies a defective outcome condition. The step involves 
articulating a selection condition in any relational language, e.g., SQL. As presented supra 
with regard to definition 2.1, an outcome condition D is any selection condition that only 
involves outcome attributes. For the manufacturing example 1.1, two possible outcome 
conditions that may be specified are Inspection = 14 and Inspection ^ 0. 

From block 34, the flow chart moves to the block 36 "Specify the Simpler-Than 
Ordering". In this step, the user 22 may explicitly encode a function (in any programming 
language), that takes to selection conditions CI, C2. The function would return "true" to 
indicate that CI is simpler than C2, or "false" to indicate otherwise. Alternatively, the 
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application developer could select a simpler-than ordering from a pick list of such 
orderings. 

The selection conditions CI and C2, are specified in accordance with the 
definitions 2.3 (Diagnosable Selection Condition) and 2.4 (Tight Diagnosable Selection 
Condition), and for the manufacturing example 1.1, sensor 1 = 36; sensor 2 = 6 is an 
example of a tight diagnosable selection condition for an outcome condition D of 
Inspection = 14. Suppose CI is sensor 1 = 36 and suppose C2 is 2 < Sensor 2 < 7 a Sensor 
4 = 8. 

In the block 36, the simple-than ordering is specified in accordance with the 
definition 2.8 and example 2.18 presented supra herein. For the manufacturing example 
1.1, considering CI and C2 from example 2.10, and the simpler than ordering from the 
previous example 2.18, then CI > simpler 2 C2 and CI > simpler 3 C2. 

From the logical block 36, the logic moves to the block 38 "Specify the Data 
Diagnosis Objective (ED, / )". In this step, the user 22 specifies an evaluation domain, 
a partial ordering on the evaluation domain, and a mapping that maps selection condition to 
the evaluation domain satisfying the axioms of definition 2.12. Again, the user 22 may 
explicitly write code to specify these components or just choose them from a list of 
predetermined data diagnosis objectives. 

Further, the flow chart proceeds to the logical block 40 "Specify the Semi- 
Equivalence Relation", wherein the user 22 must specify the semi-equivalence relation on 
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selection conditions in accordance with Definition 2.13 presented supra, taken into 
consideration example 2.32 and example 2.33. Semi -equivalence relation can be computed 
in block 40 either by the Algorithm 1 or the Algorithm 2 presented supra herein. 

The attribute distance threshold semi-equivalence relation can be also used to 
specify the constrain in block 40, as presented in the example 2.34. For the manufacturing 
example 1 . 1 , the conditions 45 < sensor 3 < 46 and 44 < sensor 3 < 47 will be semi- 
equivalent. By specifying the semi -equivalence relation, the procedure of the present 
invention insures that the set of conditions returned will not have any redundant elements. 
The definition of semi -equivalence is employed in the method and system of the present 
invention to a set of fringes to eliminate redundant elements by generating a compact set of 
the optimal fringes, as will be presented infra herein. 

From the block 40, the procedure flows to block 42 "Specify the Number j +1 of 
Fringes of Interest". The definition of fringes is introduced supra by the definition 2.1 1 and 
the example 2.28 presents fringes for the manufacturing example 1.1. By allowing the user 
to specify the number of fringes of interest, the user has some control over the number of 
conditions returned. 

Further, the flow chart proceeds to the block 44 "Specify Selection Condition 
Constrains". In this step, the user has the option to specify three numbers. One number 
denotes the minimal confidence that is acceptable, another denotes the minimal support that 
is acceptable, and the third denotes the maximum order of selection condition. The 
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selection constrains is specified in accordance with definition 2.7, presented supra. For the 
manufacturing example 1.1, the selection condition constrain can be as in the example 2.14 
supra, where the condition sensor 1=36 a 5 < sensor 2 < 6 would satisfy an order 2 
restriction. The minimal confidence is specified by the definition 2.5 supra; the support is 
specified in accordance with definition 2.6 supra, as presented in example 2.10 with regard 
to the manufacturing example 1.1. Among conditions that satisfy the selection constraints, 
the conditions of interest are those with a relatively high combination of support and 
confidence. In addition, relatively simple conditions are of interest that still have high 
support and confidence. Simpler conditions are less likely to "overfit" the data and thus, in 
some situations, can be more useful than complicated conditions that may have higher 
support and confidence. The interest in simpler conditions leads to use the specified 
simpler than ordering parameter, as was done in the logical block 36. 

From the block 44, the logic proceeds to block 46 "Compute optimal fringes F0, 
Fl ..." in accordance with the definition 2. 1 1 and as will be described in detail infra. 

From the block 46, the procedure flows to the block 48 "Compute a Compact 
Representation CF of the Optimal Fringes" which will be presented more in detail in further 
paragraphs. After the block 48, the procedure ends as the optimal set of conditions of the 
diagnosable attributes of the relation R for the outcome condition D has been formed. 

The computation of the optimal fringes (logical block 46) for the data diagnosis 
problem, and computation of compact representation of the optimal fringes (logical block 
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48) will be presented in detail in the following paragraphs. The system 10 uses a special 
data structure defining a condition graph (referred further herein as COG) that has been 
designed as part of the technique of the present invention. 

Assuming that R is a relation instance created in block 30 of Fig. 2, A, D is an OC, 
as specified in block 34, (£!D,q,/) is a DDO as specified in block 38, S is a set of 
selection condition constraints specified in block 44, A is a semi-equivalence relation on 
selection conditions specified in block 40, j + 1 is the number of fringes of interest specified 
in block 42, and > simpler is the simplicity ordering. All these parameters are arbitrarily 
fixed by the user to whatever value is appropriate for the particular problem domain. 

Condition Graph 

A condition graph is comprised of a set of vertices Vand a set of edges E. Suppose 
7* is a set of tight selection conditions satisfying the selection condition constraints, S. Let 
T be the subset of selection conditions that appears on one of the first j + 1 fringes. Each 
vertex has a condition field which stores a selection condition, assuming <p map a 
condition, C, to a vertex, v , in which v condition = C. The set of vertices, Kcan then be 
defined as {q>{C\C^T'}. 

The selection condition that a vertex corresponds to is stored in the condition field 
of the vertex. The set of edges, E, is defined as: 



54 



{(m,v)|w,v g V a(w. condition >- v. condition) a -,3a? e V s.t,{u, conditions ^.condition >- v. condition 

The level of a vertex, v in the graph, will correspond to the fringe of v . condition where: 

1 . level( v) = 0 if there is no other vertex v such that ( v ' v ) e E. 

2. levelfv ) =MAX ^evel(y) + l|(v\ v) eE}. 

In addition to a condition field, a vertex v will also have the following fields: 

1 . level which contains level( v ) 

2. support which contains sup( v .condition) 

3. confidence which contains con f (v .condition) 

4. parent which contains 

5. children which contains {w|(v,m)g£'}. 

A vertex with level( v) = 0 is called a root. All root vertices will be linked together 
via a linked list. The nodes of the linked list will contain two fields: 

1 . vertex which contains a root vertex 

2. next which is a point to another root node, or the value nil if it is the last node in 

the list. 
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Algorithm #3 "Build_COG" is used to build COG according to the parameters 
specified in blocks 34-44 of Fig. 2. 

ALGORITHM #3 
proc Build_COGCK, AJ),S J, Simpler) 

(* The procedure generates all tight selection conditions satisfying S, and evaluates their 
support and confidence *) 

(*For each condition generated satisfying S a call is made to Insert_COG). 
RootList = nil; (* Initially there are no vertices in the graph*) 
For each combination, <p , of attributes in A that do not violate S do 

(*TheBuildDataStructure calls builds data structures to store the projection on 
the attributes <p *) 

(* of the tuples that satisfy D and -. D respectively*) 

DPoints = BuildData Structure^, A, g> J)); 

NotDPoints = BuildDataStructure (R,A, <p , —*D)\ 

boundsActive = 0; (*The set contains the bounds of the condition currently being 
enumerated*) 
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imtstack(boundsStack); (*a stack of bounds to try in the current search 

direction*) 

boundsToInstantiate - {l < S\3 e <p}v {«9 < U\S e q>}\ 

(*To establish a condition a lower and upper bound must be instantiated for each 
attribute in <p *) 

pick ape boundsToInstantiate; 

boundsToInstantiate = boundsToInstantiate = {p}; 

valueSet = RangeQuery(6ow«d[sv4ctfve, DPoints, p ); 

(♦RangeQuery finds all tuples in DPoints satisfying boundsActive, and *) 

(♦returns the projection of the tuples on the attribute appearing in p *) 

push (boundsStack, (valueSet, p) ); 

while not emptyStack (boundsStack) do 

boundPair = pop(boundsStack);(*a boundPair contains two fields, p and 

(* p is an uninstantiated bound, and valueSet is a set of possible 
instantiations for the bound *) 

if (boundPair. valueSet* nil) then 
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pick are boundPair.valueSet; 

boundPair. valueSet = boundPair. valueSet - {r } ; 

bounds Active= bounds Active \J Instantiate (/?, r) ; 

(♦Instantiate simply instantiates the bound p with the value r 

*) 

pnsh(boundsStack t boundPair); 
if (boundsToInstantiate = nil) then 

(*A completely instantiated condition has been generated*) 
numD — CountQuery(boundsActive, D Points); 
numNotD = CountQuery (boundsActive, NotD Points); 
(*CountQuery takes a set of bounds and a set of tuples, *) 
(♦and returns the number of tuples satisfying the bounds *) 
if(Satisfy Constraints (5, boundsActive, numD, numNotD)) then 
(♦SatisfyConstraints returns true iff the condition defined by 
boundsActive satisfies S *) 

v = new vertexO; (♦allocates storage for a new vertex ♦) 
v.support=numD; v. confidence = (numD)/(numD+numNotD); 

v. level — 0, v. children— nil; v.parents=nil; 

Insert_COG(v, RootListJ p > simpler); 

end if; 

else (♦we have more bounds that need to be instantiated ♦) 
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pick ape boundsToInstantiate; 

boundsToInstantiate = boundsToInstantiate - {p} ; 

valueSet = RangeQuery (activeConstraints, DPoints, p ); 
push(boundsStack, (valueSet, p )); 
end if; 

else (*we are done with boundPair p for now*) 

boundsTo!nstantiate=boundsToInstantiate {boundPair. p) ; 
end if; 
end while; 



Algorithm #3 which is a part of the overall framework of the present invention, is 
used to build a COG according to the parameters specified in blocks 34-44 of Fig. 2. 
Referring to Fig. 3 which is a flow chart diagram of the Algorithm #3, the procedure in 
block 50 "Enumerate Set of Tight Selection Conditions T" will first enumerate all type 
selection conditions T satisfying the selection condition constraints S, specified in block 44 
of Fig. 2. From the block 50, the procedure flows to block 52 "Evaluate the Support and 
Confidence of Each Type Selection Condition", where the program evaluates the support 
and confidence of each type selection condition of the set T thereof. Further, the flow chart 
moves to block 54 'Does the Type Selection Condition Satisfy Selection Condition 
Constraints S?" Those conditions that satisfy the selection condition constraints will then 
be inserted into COG via a call to insert-COG in block 56 "Insert the tight selection 
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condition into condition graph" which will be discussed in detail in further paragraphs. If 
the condition fails to satisfy S, it is discarded in the block 58. 

The Algorithm #3 makes use of standard operations on a stack, such as initstack, 
push, pop, and isempty. These algorithms are described in D.E. Knuth, The Art of 
Computer Programming, Vol. 1, Fundamental Algorithms, Addison Wesley. 

Any data structure which supports range and count queries over point data can be 
used to store DPoints and NOTDPoints presented on Lines 8 and 9 of the Algorithm #3. 
However, a k-d tree is a data structure which is particularly effective for range and count 
queries for data of arbitrary dimensions, as can be presented in H. Samet, The Design and 
Analysis of Spatial Data Structures, Addison- Wesley, 1990. 

It is important to note that the number of combinations that the Build-COG 
Algorithm #3 must consider is polynomial if an order constrained bound is provided, but 
exponential if no such bound is provided. 

The Algorithm #4, presented infra, Insert-COG which runs in block 56 of Fig. 3, is 
used to insert conditions into the COG. The procedure insert-COG invokes three 
procedures InsertDownwards, AddChildLinks, and UpdateLevelAndEliminate that will 
be presented infra herein. 

ALGORITHM #4 



Proc Insert_COG (v,RootListj,> simpler) 

(*v is the vertex to insert, RootList points to the first root of the COG root list, and *) 
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(*j is the maximum number of fringes in the COG. > simpler is the simplicity ordering. 
Procedure inserts v into the COG. *) 
if (RootList = nil) then 

(*COG is empty, v is the first vertex inserted into the COG, new ListNode() allocates 
memory. *) 

RootList = new ListNodeO; RootList.vertex=v; RootList.next=nil; 
Else 

CurrRootPtr=RootList; BackRootPtr=nil; 
InsertedVertex=false; InsertedAsRoot=false; 
Visited={ }; 

while (CurrRootPtr * nil) do (*Traversing list of roots*) 
if (CurrRootPtr. vertex.condition >- v.condition then 

(♦results of >- evaluation is dependent on the > simpler parameter*) 
(*InsertDownwards will make v a descendant of CurrRootPtr. vertex *) 
Visited=InsertDownwards(Visited, v, CurrRootPtr. vertex, j, > simpler); 
BackRootPtr=CurrRootPtr; CurrRootPtr= CurrRootPtr.next; 
InsertedVertex=true; 
else if (v.condition >- CurrRootPtr. vertex.condition) then 
(*v needs to become the parent of CurrRootPtr. vertex *) 
CurrRootPtr. vertex.parents=CurrRootPtr.vertex.parents kj {v}; 
v. children=v. children^ {CurrRootPtr. vertex] ; 
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(*UpdateLevelsAndEliminate updates the levels of CurrRootPtr.vertex and 
descendants *) 

UpdateLevelsAndEliminate (CurrRootPtr. Vertex, v.level + 1, j); 
if (-.InsertedAsRoot) then (*add vertex to root list, remove old root *) 
InsertedAsRoot = true; InsertedVertex = true; 
TempRootPtr = new ListNode( ); 

TempRootPtr.next = CurrRootPtr.next; TempRootPtr. vertex= v; 
if (BackRootPtr = nil) then 

RootList = TempRootPtr; 
else 

BackPtr.next=TempRootPtr; 
end if; 

BackPtr=TempRootPtr;CurrRootPtr=BackPtr.next; 
else (*just remove the old root*) 

BackPtr.next=CurrRootPtr.next; CurrRootPtr = BackPtr.next; 
end if; 

else 

(*v is not a descendant of CurrRootPtr.vertex, but still may be the parent *) 

(* of descendants of CurrRootPtr.vertex calling AddChildLinks to find children of 

v*) 

Visited = AddChildLinks (Visited, v, CurrRootPtr.vertex, j, > simpler); 
BackRootPtr = CurrRootPtr; CurrRootPtr = CurrRootPtr.next; 
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end if; 
end while; 

if (-ilnsertedVertex) (* vertex has not been inserted, adding to the front of the root list 

*) 

TempRootPtr = newListNodeO; 

TempRootPtr.vertex = v; TempRootPtr. Next = RootList; 
RootList = TempRootPtr; 
end if; 
end if; 
return RootList; 
end proc; 

The procedure InsertDownwards, presented as the Algorithm #5, takes 5 parameters into 
consideration. The first parameter "Visited" is a set of vertices already visited, used to 
avoid visiting the same vertex multiple times; v is a vertex to insert, and "parent Vertex" is 
a vertex which is known to be better than v , but v is not yet a descendant of it. The 
parameter j is a maximum fringe number specified by the user. The last parameter and > 
simpler is the simpler than ordering. The procedure either called AddChildLinks and 
InsertDownward recursively, or adds the appropriate child and parent links directly. 
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ALGORITHM #5 
procInsertDownwards (Visited, v, parent Vertex, j, > simpler) 
(* Visited is the set of vertices already visited, v is the vertex to insert, *) 
(*parent Vertex is a vertex better than v, j is the maximum fringe number, and Ampler is the 
simpler-than ordering *) 

if (parent Vertex £ Visited) then 

Visited = Visited *u {parentVertex} ; 
InsertedVertex = false; AddedLinkToParent = false; 
for each u e parent Vertex.children do 
if (uxondition >- v.condition) then 

(*v also needs to be a descendant of u, calling InsertDownwards recursively *) 
Visited = InsertDownwards (Visited, v, u, j, > simpler); 
InsertedVertex = true; 
else if (v. condition >- u.condition) then (*v will be an ancestor of u *) 
if (V w g u.parents w. condition >- v.condition) then 
(*v will be a parent of u, if not already *) 
u.parents=u.parents- {parentVertex} ; 
parent Vertex.children = parent Vertex.children ^ {v}; 
AddedLinkToParent=true; 

UpdateLevelsAndEliminate (v, parentVertex. level + 1, j); 
end if; 

if (ug Visited) then (*connect u to v *) 
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Visited = Visited u {u}; 
u.parents=u.parents {v}; 
v.children=v.children u {u}; 
UpdateLevelsAndEliminate (u, v.level + 1, j); 
end if; 

InsertedVertex = true; 
end if; 

else (* v may still be the parent of descendants of u *) 
Visited = AddChildLinks (Visited, v, u, j > S impier); 
end if; 
end for; 

if (-1 InsertedVertex) then (*make v the child of parent Vertex *) 
v.parents = v.parents u {parent Vertex}; 
parent Vertex.children=parentVertex.children {v}; 
end if; 
end if; 
return Visited; 
end proc; 



The procedure AddChildLinks takes a vertex v being inserted, and a 
"parent Vertex", v is not a descendant of "parent Vertex", however, v may be the parent of 
descendants of "parent Vertex". If v is better than a child of "parent Vertex", then the 
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appropriate links are added. The procedure AddChildLinks is presented as the Algorithm 
#6. 

ALGORITHM #6 
proc AddChildLinks (Visited, v, parentVertex, j, > simpler) 
(♦Visited is the set of vertices already visited, v is the new vertex being inserted, *) 
(*parent Vertex is not better than v, j is the maximum fringe number, and Simpler is the 
simpler-than ordering *) 
for each u e parentVertex.children s.t. u £ Visited do 
Visited = Visited <u {u}; 

if (v.condition^u.condition) then (*v is better than u*) 

if ( V we parents w.condition >- v.condition) then (* v is the parent of u *) 

u.parents = u.parents u {v}; 

v.children = v. children u {u}; 
end if; 

else (* v is not better than u, call AddChildLinks recursively *) 
Visited = AddChildLinks (Visited, v, u, j, > simpler); 
end if; 
end for; 
return Visited; 
end proc; 
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The next procedure UpdateLevelsandEliminate is used to update the level of the 
vertices in the graph COG, and to remove vertices that fall below the jth level. This 
procedure UpdateLevelsandEliminate is presented as the Algorithm #7. 

ALGORITHM #7 
procUpdateLevelsAndEliminate (v, LevelLowerBound, j) 

(*The vertex v cannot have a level any lower than LevelLowerBound and may need to be 
updated*) 

(*If the level of v is greater than j it needs to be removed*) 
oldlevel = v.level; 

v.level = MAX (v.level, LevelLowerBound); 
if (v. level > j) then 

(*v will be removed from the graph, by removing all links to it from its parents*) 
for each u e v. children do 

u.children = u.children - {v}; 
end for; 

else if (v. level > oldlevel) then (*level of v has changed*) 

(♦recursively call UpdateLevelsAndEliminate on all descendants of v *) 
for each u e v.children do 

UpdateLevelsAndEliminate (u, v, level + 1, j); 
end for; 
end if; 
end proc; 
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After computing the optimal fringes F°,...F j , in block 46 of Fig. 2, as presented by 
the flow chart of Fig. 3 and the Algorithms #3 - #7, a compact representation of these 
fringes is generated in block 48 of the Fig. 1 by invoking the Compact_Fringes Algorithm 
#8: 

ALGORITHM #8 
procCompact_Fringes (RootPtr, (ED, c , f), A 

(*The procedure returns the set of vertices in the COG pointed to by RootPtr after removing 
semi-equivalent conditions. *) 
CompactSet = 0; 

V = Traverse_COG(RootPtr); (*Procedure returns all the vertices in the COG*) 

OrderedConditions = TotalOrderConditions (V, (ED,&/), A); 

(TotalOrderConditions returns a total ordering of the *) 
(♦conditions stored in v, that is consistent with the c= of the DDO*) 
while OrderedConditions ^ 0 do 

Let C be the first condition in OrderedConditions; 

OrderedConditions=OrderedConditions - {C}; 

if(Vx€E CompactSet -.(C Ax)) then 
CompactSet=CompactSet kj {x}; 

end if; 
end while; 
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return CompactSet; 
end proc; 



Algorithm #8 uses the subroutine TotalOrderConditions (V, (£D,&/) , A) that 

returns a total ordering of the conditions stored in V - the total ordering returned must be 
consistent with the partial ordering c used in the DDO. This Algorithm 
TotalOrderConditions may be the well known problem of topological sorting for which 
many algorithms exist, as for instance, presented in D.E. Knuth, "The Art of Computer 
Programming", Vol. 1, Fundamental Algorithms. 

The Algorithm #8 also uses the subroutine Traverse_COG (Algorithm #9) to 
traverse the COG pointed to by RootPtr, and returns all the vertices in the COG. 

ALGORITHM #9 
proc Traverse_COG (RootPtr) ~~ "~ 

(♦Procedure returns all vertices in the COG pointed to by RootPtr*) 
(The procedure traverses the root list, calling TraverseDown_COG for each root*) 
V = 0; 

while (RootPtr * nil) do 

V = V kj TraverseDown_COG (RootPtr. vertex, V); 

RootPtr = RootPtr.next; 
end while; 
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return V; 
end proc; 



The sub-routine TraverseDown_COG of TraverseDown, is presented as 
Algorithm #10. 

ALGORITHM #10 
proc TraverseDown.COG (v, Visited) 

(♦Procedure adds all vertices that are descendants of v, and not already in Visited to 
Visited*) 

if (v g Visited) then 

Visited = Visited u {v}; 
for each u e v. children do 

Visited = TraverseDown_COG (u, Visited); 
end for; 
end if; 
return Visited; 
end proc; 



The execution time of the Algorithm #3 (Build_COG) depends primarily on the 
number of conditions that must be explicitly enumerated in Block 50 of Fig. 3. The number 
of conditions that must be explicitly enumerated depends on the number of combinations of 
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attributes that must be considered, and then for each combination of attributes the number 
of tight conditions for that combination. The number of combination of attributes that must 
be considered depends on the number of attributes in the data set, as well as the order 
restriction on the selection conditions, if any. 

For a given combination of attributes, the number of tight conditions will depend on 
the number of tuples, percentage of tuples that satisfy the outcome condition, and 
characteristics of the attributes. For example numeric attributes will often be the source of 
more tight conditions than non-numeric attributes, and attributes with more values in their 
domains will also often be the source of more tight conditions. 

The number of calls that Build_COG will actually make to the procedure 
Insert_COG in Block 56 of Fig. 3 is equivalent to the number of tight selection conditions 
that satisfy the selection condition constraints. The execution time of Insert_COG and all 
the procedures called as the result of its execution will depend primarily on the number of 
conditions present in the COG. The number of conditions present in the COG will depend 
on the simplicity order and number of fringes chosen. 

Some experimental results will be presented infra conducted on the Census Income 
data set (Table 3) using a Pentium 933Mhz machine with 384 Megabytes of RAM. In the 
experiments the semi-equivalence relation used was that of Example 2.37, and the 
simplicity ordering was >simpier2 from Example 2.18. Unless stated otherwise the number of 
fringes found was 2, and the order constraint was 1. 

The plot of Fig. 4 shows execution time versus the order constraint. Since having 
an order constraint of three or greater can be extremely expensive computationally, the test 
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was run on a 1000 tuple sample of the data set, and for this experiment only, age and hours- 
per-week values were rounded to the nearest multiple of 10. The plot of Fig. 4 indicates 
exponential growth in execution time, as the order constraint increases. Fortunately in 
many real-life situations conditions of low order are of more interest than conditions of 
higher order. 

The plot of Fig. 5 compares execution time of the procedure versus the number of 
attributes in the data set. The experiments were run on the unmodified 32,561 tuple data 
set. For control purposes only non-numeric attributes were included in the experiment. As 
expected, the increase in execution time was linear in the number of attributes, since the 
order constraint was 1 . 

The plot of Fig. 6 shows the results from an experiment conducted on five different 
10,000 tuple samples of the data set. The percentage of tuples that satisfied the outcome 
condition was varied. In many applications domains, such as manufacturing, the percentage 
of tuples that satisfy the outcome condition of interest will generally be quite small, thus 
reducing execution time. 

The plot of Fig. 7 shows the growth in execution time as the number of tuples in the 
data set was increased. The increase in execution time as the number of tuples increase, can 
be partially explained by the increase in the number of tight conditions. Also an increase in 
the number of tuples causes the evaluation of the support and confidence of conditions to 
take longer. 
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The graph of Fig. 8 illustrates execution time versus the number of fringes. The plot 
of Fig. 8 shows that increasing the number of fringes will generally increase execution time, 
though the increase is relatively small. 

Although this invention has been described in connection with specific forms and 
embodiments thereof, it will be appreciated that various modifications other than those 
discussed above may be resorted to without departing from the spirit or scope of the 
invention as defined in the appended Claims. For example, equivalent elements may be 
substituted for those specifically shown and described, certain features may be used 
independently of other features, and in certain cases, particular locations of elements may 
be reversed or interposed, all without departing from the spirit or scope of the invention as 
defined in the appended Claims. 
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